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Abstract 

The computational effort in the calculation of Wilson fermion quark 
propagators in Lattice Quantum Chromodynamics can be considerably 
reduced by exploiting the Wilson fermion matrix structure in inversion 
algorithms based on the non-symmetric Lanczos process. We consider 
two such methods: QMR (quasi minimal residual) and BCG (biconju- 
gate gradients). 

Based on the decomposition M/k = 1/k — D of the Wilson mass 
matrix, using QMR, one can carry out inversions on a whole trajectory 
of masses simultaneously, merely at the computational expense of a 
single propagator computation. In other words, one has to compute 
the propagator corresponding to the lightest mass only, while all the 
heavier masses are given for free, at the price of extra storage. 

Moreover, the symmetry 75 M = 75 can be used to cut the 
computational effort in QMR and BCG by a factor of two. We show 
that both methods then become — in the critical regime of small quark 
masses — competitive to BiCGStab and significantly better than the 
standard MR method, with optimal relaxation factor, and CG as ap- 
plied to the normal equations. 



1 



1 Introduction 



Lattice QCD allows to compute physical quantities like the hadronic spec- 
trum, weak decay constants and other weak matrix elements without re- 
course to perturbation theory ||]. Basic building blocks for the construction 
of such observables are the quark propagators, i.e. the Green's functions of 
the fermionic operator. In practice, the latter are determined via an iterative 
inversion method on an ensemble of gauge background fields, generated in a 
Monte Carlo process. Thus, the evaluation of quark propagators represents 
a major task within this branch of elementary particle theory. 

One can work in two directions in order to achieve good efficiency com- 
puting the propagators: 

• acceleration of the convergence using improved iterative procedures 




• exploitation of the structure of the matrix M in the implementation 
of the inverter. 

Of course, these two directions are not mutually exclusive. 

In this note we go one step into the second direction and point out how 
the structure of M can be exploited in iterative methods based on the non- 
symmetric Lanczos process Qj. Specific methods in this class comprise the 
BCG (biconjugate gradient) method of Fletcher j|] and the more recent 
QMR (quasi minimum residual) method of Freund and Nachtigal ||. We 
show that these Lanczos based methods can be very useful in carrying out 
the extrapolation to the chiral limit: one can perform inversions on a whole 
trajectory of masses simultaneously. In other words, one has to compute 
the propagator corresponding to the lightest mass only, while all the heavier 
masses are almost given for free, at the price of extra storage. 

Moreover, we shall also point out how the particular symmetry properties 
of the Wilson fermion matrix can be used to further reduce the costs of each 
iterative step in QMR or BCG by a factor of 2. This fact, mentioned in 
Ref. JjJ has recently attracted attention in the lattice QCD community. 

After discussing the basic properties of the Wilson fermion matrix in 
Section 2, we will present the QMR and BCG algorithms in quite some de- 
tail (Section 3) and explain the savings in computational effort due to the 
structure of the Wilson fermion matrix. Section 4 contains the results of our 
numerical experiments on the CM5 parallel computer at Wuppertal univer- 
sity. In particular, we will compare the QMR algorithm with BiCGStab 
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[3], BCG and with the standard over-relaxed MR (minimal residuum) 
method || on realistic configurations for different values of k, approaching 
the critical regime of very small relative quark masses. All these calculations 
were done using standard odd-even preconditioning. 

2 Basic properties of the Wilson fermion matrix 

The Dirac operator in its discretized form as given by Wilson reads: 

M = 1 - kD, (1) 

with the off-diagonal hopping term 

4 

D x, y = H( 1 ~ 7m) u ^ x ) 5 *>,y-» + (! + In) u t( x ~ M) ( 2 ) 

11=1 

In Eq. [|, the {U^(x)} represent the gauge background field on a four-dimen- 
sional Euclidean space-time lattice. 

In the following it is preferable to scale M by a factor -: M — > -M. We 
shall thus consider the solution of the linear equation 

Mx = (-1 - D)x = (j). (3) 

The matrix M = ^1 — D in Eq. ^ has two important properties which 
will be crucial to algorithms for solving Eq. ||: 

• M is 75-symmetric, i.e. 

M 75 = 75 

where 75 is the permutation matrix which commutes the Dirac com- 
ponents 1 with 3 and 2 with 4 on each lattice site. In particular, 75 is 
unitary and hermitian: 

75 = 75 = 7s" 1 • 
Multiplying a vector by 75 is a very cheap operation. 

• M is a shifted matrix with respect to its dependence on k, i.e. M is 
the sum of a multiple of the identity and a constant off-diagonal part 
D. 
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When building up M one has the freedom to choose any ordering scheme 
for the lattice sites. Subdividing the lattice in a checkerboard style into even 
and odd sites and numbering all even sites before the odd ones results in 
the following two-cyclic structure for M 



M 



(1/k)1 
-D nP 




(4) 



Using the indices 'e' and 'o' to denote even and odd sites, respectively, we 
can thus rewrite Eq. as 



;i/«)i -D eo 

-D oe (1/k)1 




This equation separates into 



M e x e _ 



where 



Mr 



b Q -\- D oe x e 



D m D„ 



(5) 
(6) 



(7) 



-(j) e + DoeC, 
K 



'0 • 



(8) 



Eq. 5 is called the odd-even preconditioned system jll], 12], with M e given by 
Eq. 7. It has become standard to solve the odd-even preconditioned system 
rather than the original one since iterative methods for Eq. [| converge faster 
than for Eq. ||. 

In our context, it is very important to notice that the matrix M e of 
the preconditioned system conserves both basic properties of M as stated 
before, i.e. M e is still 75-symmetric and a shifted matrix (with factor 1/k 2 
instead of 1/k). Therefore, using QMR or BCG we can take advantage of 
the particular structure of M e in just the same manner as with M. However, 
one should be aware of that the new source term cj) e in Eq. |8| will depend on 
k as soon as 4> e 7^ 0. This must be accounted for when exploiting the shifted 
structure of M e . Details will be given in the next section. 
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3 QMR and BCG 



We consider two different iterative methods for solving Eq. ||: The BCG 
(biconjugate gradient) method of Fletcher []|] and the QMR (quasi mini- 
mum residual) method of Freund and Nachtigal ||. Generic formulations 
for these methods are given in Algorithms |l] and [2], see also Ref. [13]. Al- 
though QMR looks somewhat more involved than BCG, both methods re- 
quire approximately the same computational work per iteration: One matrix 
multiplication with M and another with M', the additional scalar and vec- 
tor operations being negligible in either method. QMR will usually reduce 
the norm of the residual in a much smoother manner than BCG does, thus 
making QMR more stable numerically. 



choose r £ C, set p° = r u = <j) - Mx 1 
choose r° £ C ra , set pP = r° 
for m = 0, 1, . . . 

x m+l =x m + §mp m 
r m+l =r m_ S m M P m 

„m+l — r m+l I „ >m 

P — ' ~r l->rnt> 

r 1 ' ' rmf 



Algorithm 1: BCG method. 

The vectors r° in BCG and w° in QMR can be chosen freely. If we 
take r° = up, it can be shown [|l4|] that r^ CG is a scalar multiple of v m 
and f" 1 is a scalar multiple of w m . (We use the subscripts BCG and QMR 
to distinguish between quantities which are otherwise denoted by the same 
symbol in both methods). Thus, QMR and BCG are intimately related and 
one can show that |l~4| l 

X BCG = X QMR - Vm—PQJMR ( 9 ) 
t ~"m 

and 

WbcgW = \\r%- \si...s m \ •— • (10) 
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choose x° £ C n , set v° = b - Mx° 

choose uP £ C n 

{initialize} 

set hq = ||w°||,(5o = l,c_i = c = l,s_i = s = , 
set p^ 1 = p~ 2 = i;" 1 = = 
for m = 0, 1, . . . 

{next Lanczos step} 

Pm = \\v m \\, Vm = \\w m \\ 

v m = v m /p m ,w m = \\w m \\r hn 

5 m = ( w m )h m 

a m = (w m )^Mv m /5 m 

0m = Vm^m/^m—l,^fm = Pm$m/ $m— 1 

v m+l = Mv m - a m v m - Prnv" 1 ' 1 
^m+i = M ] w m _ a mW m - 7 m w m_1 
{update QMR recurrence coefficients} 

Set @ m +i = S m —if3 m , £rrt+l = Cm— iPmi &m+l = Cm&m+1 ~\~ s mC^m, 
$m+l = —JSmMm+1 + C m a m , V m +1 = (\S m +i\ 2 + |7 m +i| 2 ) 1//2 , 
C-m+1 = \&m+l\/v m +l, S-rn+l = if <5 m+ i = 0, 

/$m+l if <5m+l / 0, 
<5m+l = C m +l<5 m+ i + S m+ i7 m+ i 

{update QMR iterate} 

set p m = (v m - e m+lP m - 1 - e m+lP m - 2 )/5 m+1 



Pm — C-m+lPm 
x m+l =x m + ~ m 

pm+1 = Sm+lpm 



Algorithm 2: QMR method. 
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So, the BCG iterates can be retrieved very easily from the QMR iterates. 
The generation of v m and w m as given in the QMR algorithm is called the 
non- symmetric Lanczos process. This process is the common basis of QMR 
and BCG. 

Both methods, QMR and BCG can break down prematurely due to 
b m = in QMR or zero divisors in p m or 5 m in BCG. This is why the state-of- 
the-art package QMRPACK Jl5| (distributed freely through the netlib server 
netlib@ornl.gov) contains modifications of QMR in which 'look-ahead' 
Lanczos steps |l6| are incorporated. This avoids premature breakdowns at 
the expense of extra storage and it further enhances the numerical stability. 



choose r £ C n , set p° = r° = <j) - Mx 1 
for m = 0, 1, . . . 

S m = (7 5 r m ) t r m /(^ m )t( 75 p m ) 

x m+l =x m + §mp m 

r m+l =r m_ S m Ap m 

p m = ( 75r m+l)t r m+l/( 75r m)t r ™ 

r.rn+1 — r m+l _i_ _ „m 



Algorithm 3: BCG exploiting the 75-symmetry. 

If we choose f° = 7sr° in BCG, an easy calculation shows that due to 
the 75-symmetry of M we have r" 1 = 7sr m for all m and the scalars <5 m , p m 
in BCG are all real. Consequently, the computational effort per iteration 
reduces to only one matrix multiplication, see Algorithm ||[ Similarly, if we 
take w° = 75V in QMR, the non-symmetric Lanczos step yields w m = ^v m 
for all m and the scalar quantities in QMR again become all real. This 
simplified version of the Lanzcos process is given in Algorithm |||. So, once 
more, the multiplication with can be saved, reducing the computational 
effort in QMR by a factor of two. In the more general case of so-called 
P-symmetric matrices, the above simplifications were already described in 
Ref. 

Additionally, we note that M being a shifted matrix, M = o~l — D with 
a = 1/k , we can rewrite the generation of v m in the non-symmetric Lanczos 
process of Algorithm |] as 

v m = —Dv m ~ l - (a m _! - a)v m ~ l - {5 m ^v m ~\ 



7 



choose v 



set t> 1 ■ 


= 0, 5 = 1 


for m = 


0,1,... 


Pm — 


\\v m \\ 


v m = 


V m / Pm 


&m — 






(j 5 v m )^Mv m /5 m 


— 


Pm^raj &m—l 


~m+l 


= Mv m - a m v m - (3 m v m 



Algorithm 4: The non-symmetric Lanczos process exploiting 75-symmetry. 
with 

a m _! - a = ( 75 ^ m - 1 ) t (-£')« m - 1 /^-i. 

This shows that the Lanzcos vectors v m and v m depend only on D but not 
on a, provided one always has the same initial vector v°. The recurrence 
coefficients (3 m , T] m ronicLin unchanged whereas the coefficient otm 

changes to 

ot m + o" if we change the matrix from —D to al — D. 

Consequently, if we simultaneously solve several systems 

(—1 - D)xi = (f>, i = l,...,l 

using QMR, the Lanczos part itself has to be performed only once, provided 
we take the initial guess x® = for i = 1, . . . , I. In fact, x° = for i = 
1, . . . , I leads to the same initial residual = (j) for i = 1, . . . ,1 so that 
v° = (f> represents the initial vector of the Lanczos process for all i. These 
observations go back to Ref. |18[ |. 

Note that the odd-even preconditioned system Eq. || has the same shifted 
structure as the original Wilson- Fermion matrix with a = 1/k 2 . The right 
hand side <p e = ^4> e + D oe (f)Q, however, will usually depend on k so that 
it is impossible to easily find initial guesses which lead to the same initial 
residual. In that case we propose to consider the two systems 

M e Z e = D oe <j) , 1 J 

which now both leave an initial residual that does not depend on k if the 
initial guess is taken to be zero. Hence, again, for each of the two systems 
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the non-symmetric Lanczos process needs to be performed only once for 
different values of k. 

So, in the QMR method we can treat several values of k at the same 
time without introducing any additional matrix multiplications. This also 
holds for the BCG iterates if we compute them from the QMR algorithm 
via Eq. [9|. The price to pay is 4 vectors extra storage for each additional 
value of k. This can be reduced to 3 vectors if we refrain from updating the 
residuals. Indeed, it was shown in Ref. H that 

||r m || < ||r || 2 • \Jm +1 • • • s m \ =: T m . 

The scalar r m can be updated easily, and since it represents an upper bound 
for ||r m || it can be used in a stopping criterion. It has turned out to be a 
good choice checking for r m < 10 • e if one wants to have ||r m || < e, but, 
of course, the latter inequality should be re-checked by explicitly computing 
r m and ||r m || once r m < 10 • e is fulfilled. 

Finally, we just mention that all the above simplifications remain valid 
if we incorporate the look-ahead Lanczos process fl8||. 



4 Results 

Our numerical computations were done on the CM5 parallel computer at 
Wuppertal university. We tested and compared five different methods for 
solving the odd-even preconditioned system Eq. [|: QMR and BCG as de- 
scribed before, the standard over-relaxed MR (minimal residual) method, 
the BiCGStab method and the usual conjugate gradient method applied to 
the normal equation 

M\M e x e = M\~4> e . 

This method is abbreviated CGNE. 

In our comparative study, we set high value on trying to be close to 
realistic lattice gauge applications: at = 6.0, we have generated an 
ensemble of 10 decorrelated quenched gauge configurations on a lattice 
of size of 16 4 , see also Ref. ||. We worked with a series of k- values, 
k = 0.152, 0.153, 0.154, 0.155, 0.1553, that corresponds to a quark mass range 



0.1 > niq > 0.03. In Refs. |T^, ^(J, 21 1, propagator computations on these 



k- values resulted in a nearly linear pion mass trajectory as function of 1/k. 

As is well known that the convergence behaviour of iterative solvers can 
depend on the source vector, we adapted the Wuppertal source smearing 



9 



method |22| to build our source: starting from a location x on a given time- 
slice the smeared source is generated according to the smearing procedure 

0(£,*)->^(£,t) ' 



1 — 6a 

3 



^4>{x,t) + \Ui(x,t) 4>(x + ei,t) + uj(x — (%,t) <j>(x - e%,t) }, 

(12) 



with the unit vector e*j pointing in spatial direction i. We used a = 4 and 
performed 100 smearing iterations. 

Our first diagram (Figure |l]) reports result obtained on one given config- 
uration of the ensemble with a smeared source and k = 0.155. We display 
the convergence history for the different methods by plotting the norm of 
the residual (normalized to ||r°|| = 1) against the number of matrix multi- 
plications involved. Since in either method all the computational effort is 
very highly concentrated on the matrix multiplications, Figure [l] may also 
be interpreted as giving the norm of the residuals as a function of comput- 
ing time. Here, a matrix multiplication is a multiplication with M e or MJ. 
Note that each iterative step of CGNE and BiCGStab requires two matrix 
multiplications, whereas MR, QMR and BCG require only one since we 
take advantage of the 75-symmetry. In all methods, our starting vector was 
the zero vector. In BiCGStab, the 'shadow residual vector' r° was chosen 
equal to the inital residual, which means that we used BiCGStab exactly as 
described in Ref. ||. 

Figure [l| shows that CGNE is by far the slowest method and that MR also 
performs substantially worse than the remaining three methods, although 
we used the optimal relaxation factor in MR. This factor was found to 
be 1.1 by numerical experimentation. Figure [l] also illustrates the wide 
fluctuations in the residual norm which one usually observes in BCG. On 
the other hand, QMR converges very smoothly. The speed of convergence 
of all three methods, QMR, BCG and BiCGStab looks quite comparable, 
with BiCGStab being slightly better at the beginning, whereas BCG and 
QMR perform a little better towards the end. To reach the required residual 
norm of 10~ 10 , QMR and BCG were the fastest of all methods with QMR 
performing some 10% better than BiCGStab. If we had not made use of 
the 75-symmetry, QMR and BCG would have taken twice the number of 
matrix multiplications so that both methods would become clearly inferior 
to BiCGStab. This is the reason why, based on computations on a cold 
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Figure 1: Convergence history for different methods 
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and a hot model configuration and without exploiting 75-symmetry QMR 
and BCG were regarded as less efficient than BiCGStab and its variant 
BiCGStab2 in Ref. g. 

We now turn to demonstrate the additional savings possible by exploiting 
the 75-symmetry together with the shifted structure of the Wilson fermion 
matrix. We compare QMR-MULT, the QMR method for solving Eq. [| om 
an entire set of ft-values simultaneously (so the Lanczos process is carried out 
only once), with a standard sequential computation on these k values. To 
make a fair comparison, we used the educated guess technique in the latter 
case, i.e. the final result of a computation for which the previous k was 
taken as the new starting vector for the next k. For this serial treatment 
of the different values of k we tried CGNE, over-relaxed MR, BiCGStab 
and QMR itself. For these comparisons we used the whole ensemble of 10 
configurations. The results for CGNE were so inferior in comparison to the 
other methods that we decided to not include them into our diagrams. 

Figure [| gives the total number of matrix multiplications m as a func- 
tion of the number of k's for which the calculations have been done. More 
precisely, a value of i on the horizontal axis refers to the calculation of i 
different values K\ , . . . , Kj, where 

nx = 0.152, k 2 = 0.153, k 3 = 0.154, k 4 = 0.155, k 5 = 0.1553. 

In Figure ^, the source term was taken to be a point source. The initial 
vector was the zero vector for all k's in QMR-MULT and for the first value 
of n in all other methods. As a consequence, a look-ahead Lanczos step 
had to be performed in QMR and QMR-MULT at the very beginning of the 
iteration due to (750 e ) <fie = 0. We used a simple modification of Algorithm || 
to perform this look-ahead stepS. 

In either method the iteration was stopped when the norm of the resid- 
ual, weighted by the norm of the source term (which is the inital residual 
if we take starting vector zero) was less than 10~ 10 . The representation in 
Figure Q shows the average values over the whole sample of 10 configura- 
tions. The deviation of the results for the individual configurations from the 
average was quite small, ranging from less than 2% for small values of k to 
never more than 10% for the largest k in either method. 

Figure ^ refers to exactly the same computations as Figure showing 
now the computing time on the CM5 as a function of the number of k's. A 

1 A more elaborate implementation of the QMR method exploiting 75-symmetry and 
the shifted structure based on QMRPACK is currently under development. 
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Figure 2: Total number of matrix multiplications as a function of the number 
i of the n's for which the calculations have been done. 

comparison with Figure |2| very clearly establishes that the computing time 
is proportional to the number of matrix multiplications. Of course, in QMR- 
MULT, the additional work to be spent in updating the iterates increases as 
we treat more k's simultaneously, but this only minorly affects the overall 
computing time. 

Figure |2] and ||] show that the more values of k we treat simultaneously 
in QMR-MULT, the more we gain in computing time against the educated 
guess variants. While for the case of one single k all methods perform 
comparably well, the 'several-on-one-stroke' approach pays out as soon as 
we treat 2 or more values of k simultaneously. For 5 values of k, QMR-MULT 
is almost three times as fast as the most rapid educated guess method (via 
BiCGStab). 

As was pointed out at the end of Section 2, using other sources than 
point sources will usually require QMR-MULT to be performed on two sys- 
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Figure 3: Computing time on the CM5 normalized to the time needed by 
QMR-MULT as function of the number i of k's for which the calculations 
have been done. 



terns with different source terms, whereas this is not necessary for the other 
methods. In that case we thus expect the QMR-MULT approach to become 
approximately two times more slowly. Nevertheless, it would be faster than 
all methods based on the educated guess as soon as we treat 4 or more values 
of n simultaneously. For 5 values of k, for example, QMR-MULT will still 
be some 50% better than BiCGStab. 



5 SUMMARY 

In this note we have presented and tested an extension of the QMR algo- 
rithm, the QMR-MULT, which exploits structural and symmetry properties 
of the Wilson Fermion matrix in order to speed up the inversions on a whole 
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mass trajectory. In the setting of a realistic application, we have shown that 
the QMR-MULT can save a factor two to four in computer time, compared 
to the fastest algorithms which are presently in use. 
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